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Abstract. We investigate layered neural networks with differentiable activation 
function and student vectors without normalization constraint by means of equilibrium 
statistical physics. We consider the learning of perfectly realizable rules and find that 
the length of student vectors becomes infinite, unless a proper weight decay term is 
added to the energy. Then, the system undergoes a first order phase transition between 
states with very long student vectors and states where the lengths are comparable to 
those of the teacher vectors. Additionally in both configurations there is a phase 
transition between a specialized and an unspccialized phase. An anti-specialized phase 
with long student vectors exists in networks with a small number of hidden units. 



Statistical physics has been applied successfully to the investigation of equilibrium states 
of neural networks. [|], |2| The by now standard analysis of off-line training from a fixed 
training set is based on the interpretation of training as a stochastic process which 
leads to a well-defined thermal equilibrium. Investigations of perceptrons || |], |J or 
committee machines |], [7], |8], |9|, [TIJ have widely improved understanding of learning in 
neural networks. Meanwhile these studies are being extended to the more application 
relevant scenario of networks with continuous activation function and output. [|11], [12], p~3|j 
The soft-committee machine is a two-layered neural network which consists of a 
layer of K hidden units, all of which are connected with the entire iV-dimensional input 
£. The total output a ist proportional to the sum of outputs of all hidden units: 

a (0 = 7k ££=i 9 i x j) where x i = 7wJj • i (!) 

where the weights of the j-th hidden unit are represented by the iV-dimensional vector 
J_j. We investigate learning of a perfectly matching rule parametrized by a teacher 
network of the same architecture with output r and orthogonal vectors B_p which we 
assume to have the length y/N. The transfer function g(x) is taken to be a sigmoidal 
function, e.g. the error function. Networks of this type have been studied in the limit 
of high temperature [11], the annealed approximation [13|, and by means of the replica 
formalism [|L2]| . All these studies imposed the simplifying condition that the order 
parameters Qij = J_i • J-j/N are restricted to the value 1 for i — j, so the length of 
the student vectors is fixed to that of the teacher vectors. This system shows a phase 
transition between an unspecialized configuration, where the student-teacher overlaps 
Rij — J-vB-j/N are identical for all i, j and a specialized configuration where Ru ^ Rij for 
i ^ j. However, constraining the student lengths implies significant a priori knowledge 
of the rule which is not available in practical applications. So, in this paper we want to 
obtain first results for soft committee machines which determine student lengths in the 
course of learning. 

Learning is guided by the minimization of the training error 

^-V^UIHJ-^L)) 2 (2) 
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where P is the number of examples used for training. After training, the success 
of learning can be quantified by an average of the quadratic error measure over the 
distribution of possible inputs, the so-called generalization error: 

% = f(K)-T(0) 2 )^ (3) 

Following the standard statistical physics approach, we consider a Gibbs ensemble, 
which is characterized by the partition function Z = J dfi({Ji\) exp(— (3H{{ Jj})) with a 
formal temperature 1/(3 which controls the thermal average of energy in the equilibrium. 
The extensive energy if is a function of the training error, the standard choice being 
H = Pe t . Typical equilibrium properties are calculated from the associated quenched 
free energy — (1/(3) (lnZ) =: f'N where the average is performed over the random set 
of training examples. The evaluation of (In Z) in general requires the rather involved 
replica formalism. To obtain first results we consider the simplifying high-temperature 
limit (3 —> 0, f| . The calculation of equilibrium states is guided by minimization of 
(3f = aKe g — s. Here a = j3P/ (NK) ist the rescaled number of examples, which we 
assume to be 0(1) and s the entropy per degree of freedom with order parameters held 
fixed. The latter is given by 

s — 1/2 In det Q_ + irrelevant const. (4) 

where C is the 2K x 2if-matrix of all cross- and self-overlaps of student and teacher 
vectors. Equation £| is of quite general validity and can be derived by means of a saddle 
point integration from the definition of the entropy. In [|l|] a simpler derivation is 
presented. 

Here we assume the components of all examples to be independent random numbers 
with mean zero and unit variance. Then, in the thermodynamic limit iV —>■ oo the 
generalization error can be calculated analytically, if we choose the activation function 
g(x) = erf(x/ v / 2) M, which is very similar to the more popular hyperbolic tangent, 



so the basic features of the model should not be altered: 



f = - 4- — S T K 

c 9 6 ~ Ktt ^i,k=l 



sin" 1 . Q ' k -2 5m- 1 . §* (5) 

In the following, we will first investigate the simplest case K — 1, i.e. a network 
consisting of one single unit to show the basic principles. Then we will study networks 
with arbitrary K and finally investigate the limit K — > oo of very large networks. 

In the K = 1 case equations ^ and || read: 

e g = i + i sin" 1 (Sn)- 1 sin" 1 ( ~r^=) (6) 

9 6 7T \1+QJ it Va/ 2 (1+Q) / 

s =\\n(Q-R 2 ) (7) 

Trying to minimize ae g — s, we find that e g remains of order 1 for arbitrary R, Q while s 
becomes infinite for Q — > oo, yielding / — > — oo. This means that in thermal equilibrium 
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the length of the student vector increases to infinity, while its overlap with the teacher 
becomes irrelevant. Of course, this is not the desired result of training. The method 
of choice to avoid this behavior, is to "punish" configurations with large Q with an 
additional energy called "weight decay". This is a method of regularization which is 
widely used in practice in order to improve the generalization ability of feedforward 
neural networks ||. So we introduce H = Pe t + XNQ |Lj| 0, [lj, ^, [HJ, |22| and obtain 
(3f = atg + XQ — s with A = (3\ which has to be minimized w.r.t. R and Q. In Figure 
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Figure 1. left panel: e g (a) and Q(a) (inset) as obtained analytically (solid line) 
and results of Monte-Carlo simulations (dots) for K = 1, A = 0.001 and (3 — 0.2. 
(system size N = 100, averages over 5 runs with 10000 M.C. steps each, 5000 of which 
were used for equilibration, 5000 for sampling measurements) We get two locally stable 
states with different student lengths for some a, depending on the starting value of 
the student vector. We have used the same strategy as in to obtain the hysteresis 
behaviour, right panel: e g (a) and A(a) (inset) for K = 2 and A = 0.001. 



|T] we show e g as a function of the rescaled number of examples, a for A = 0.001. For 
small a the network is in a state with large Q (and e g ). For a > 12 a second state with 
small Q and small e g exists, which becomes globally stable at a ~ 15. At a ~ 21.6 the 
state with large Q becomes even locally unstable. We remark that this phase transition 
is solely due to the differentiable nature of the activation function, which causes the 
energy to depend on the length of the student vector, and does not occur in the simple 
perceptron. It was also not found for the simpler linear unit with g(x) = x [0, where 
the training error is more sensitive to a mismatched Q than in the case of a bounded, 
saturating transfer function. The phase transition disappears for A > 0.006. 

We have performed continuous Monte-Carlo simulations of a Metropolis-like learning 
process of the single unit. The results shown in Figure [I] confirm our theoretical results. 

In order to extend our analysis to networks with K > 2 we assume the network 
configuration to be site- symmetric with respect to the hidden units so the order 
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parameters fulfill the conditions Rij = R5ij + S(l — 5ij) and = Q5ij + C(l — 5ij). 
This assumption reflects the symmetry of the rule yet allows for specialization of the 
student, as student overlaps with teacher vectors can yield different values for i = j and 
i 7^ j. Now generalization error and entropy read: 



7T \ 1 + C 



K-l 



sin_1 (r&) 



s = \\n[{K -l)C + Q-{R+{K -l)Sf 



2 sin 



K-l 



-i 



i ■ —i 
-sin 



In [Q -C - (R- Sf 



(8) 
(9) 



The weight decay term introduced for the single unit generalizes naturally to A Y^f=\ Qu, 
so the free energy becomes (3f = aKe g + XKQ — s in a site symmetric state. Numerical 
minimization leads to the results shown in Figure [I] and |2| for K = 2 and K — 3. In 




Figure 2. left: e 9 (5) for K = 3, A = 0.001 and A(a) (inset). Different starting values 
were used in numerical minimization to calculate as many local minima of the free 
energy as possible, right: K = oo, A = 0.0001. All local minima of the free energy 
have been calculated from the saddle point equations. 



addition to the first order phase transition already observed at the single unit, which 
connects states with different lengths of student vectors, we observe transitions between 
phases which are characterized by the parameter A := R — S indicating specialization 
features. As both transitions are due to independent mechanisms, namely on the one 
hand a change of student vector lengths and on the other hand an alteration of their 
directions, specialized (A > 0) and unspecialized (A = 0) phases can exist both in the 
large-Q configuration and in the small-Q regime. Indeed for K > 3 first order transitions 
between specialized and unspecialized phases can be observed in both configurations. 
Additionally, there is a second order phase transition between the unspecialized large 
Q phase and an anti-specialized phase (A < 0) with large Q at a ~ 15. The K = 2 
system shows a second order transition in the large-Q regime, while an unspecialized 
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configuration with small Q cannot be observed. This difference in behavior results from 
the higher degree of symmetry in the K = 2 system, where the free energy is invariant 
under exchange of R and S. Consequently there is no physical difference between 
specialized and anti-specialized configurations in the K = 2 system. 

To study the behaviour of very large networks (K — > oo) scaling assumptions of 
order parameters have to be made. Supposing C to be 0(1), the output of the student 
will be 0[\[K) and thus on a different scale as the teacher output. So we assume the 
hidden unit overlaps to be 0(1/ K), writing C = C/(K — 1) and further introduce 
S — S/K, while A and Q remain 0(1). Inserting this and performing limx^oo Pf/K 
we find that the condition df/dS = can be fulfilled only if Q + C — (A + S) 2 is 
assumed to be 0(1/ K). So we substitute C = C / K + (A + S) 2 — Q before performing 
the limit K — > oo. The corresponding generalization error is shown in Figure || as 
a function of a. For small a, the network is in an unspecialized phase with large 
Q. At a « 13 a locally stable, unspecialized configuration with small Q appears, 
which is globally stable between a ~ 22 and 5 ~ 88, where the specialized small Q 
configuration becomes globally stable. However, the unspecialized configuration remains 
locally stable. Additionally, at a ~ 20 the specialized large Q phase appears, the free 
energy of which is smaller than that of the unspecialized large Q phase for a > 22.5. 
Anti-specialized configurations do not exist in the limit K — > oo. We expect them to 
be a characteristic feature of systems with small K > 3. 

In summary, we have shown by means of statistical physics that learning an unknown 
rule without a priori knowledge in the form of normalized student vectors leads to a much 
more complicated behaviour than learning with normalized students. The number of 
phases in which the system can exist increases. Further, student lengths tend to infinity 
unless the network weights are regularized by means of a proper weight decay. 

Further investigations will extend research to finite temperatures by applying the 
replica formalism and study the relevance of our results for practical training processes. 
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